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Microbial genome sequence submissions to the International Nucleotide Sequence Database 
Collaboration (INSDC) have been annotated with organism names that include the strain 
identifier. Each of these strain-level names has been assigned a unique 'taxid' in the NCBI 
Taxonomy Database. With the significant growth in genome sequencing, it is not possible to 
continue with the curation of strain-level taxids. In January 2014, NCBI will cease assigning 
strain-level taxids. Instead, submitters are encouraged provide strain information and rich 
metadata with their submission to the sequence database, BioProject and BioSample. 



Toward richer metadata for microbial 
sequences 

The NCBI taxonomy database provides the organ- 
ism nomenclature and classification that is used in 
sequence entries by the International Nucleotide 
Sequence Database Collaboration (INSDC [1]; 
comprising GenBank, ENA and the DDBJ) [2]. The 
NCBI Taxonomy Group is responsible for curating 
names for taxa that are regulated by the relevant 
codes of nomenclature [3-5], for providing infor- 
mal names for specimens that are not identified 
with Linnaean species binomials, and for main- 
taining the 'taxid' namespace. This is a labor- 
intensive and largely manual effort undertaken by 
this small group of diligent and dedicated taxon- 
omists at the NCBI [6]. 

It has been almost twenty years since the first bac- 
terial genomes started to appear in the sequence 
databases, beginning with Haemophilus influenzae 
in 1995, followed within a year by Escherichia coli. 
In those days each new genome sequence was of 
significant scientific interest and represented a 
considerable technical achievement. At that time, 
for the convenience of those at INSDC institutes 
and their users, the taxonomy group started as- 
signing strain-level taxids for prokaryotes with 
complete genome sequences, e.g.: "Haemophilus 
influenzae Rd" [7] and "Escherichia coli K12" [8]. 



[That genome is currently indexed as "Escherichia 
coli str. K-12 substr. MG1655", since there are now 
many genomes sequenced from 'strain' K-12.) 
Since that time, the policy of assigning strain-level 
taxids for genome sequences has been extended to 
cover eukaryotic microbes as well - unicellular 
fungi, algae and protists - but it has never been 
applied to the multicellular eukaryotes. In particu- 
lar, strain-level taxids have never been assigned 
for breeds of dogs, or for inbred strains of mice, or 
for individual human genomes. 

Sequencing technology has undergone remarkable 
development over the past twenty years and it has 
become increasingly cheap and easy to sequence 
genomes, a trend that promises to continue in the 
foreseeable future. We are already seeing the 
submission of hundreds of genomes at a time that 
are simply time points of micro-evolutionary stud- 
ies in Escherichia coli, or Saccharomyces cerevisiae. 
Another growing industry in genome submissions 
is in efforts to track epidemics, food-borne illness- 
es and hospital infection pathways. More will ap- 
pear as this technology finds applications in other 
fields. 

Our recognition that the curation of strain-level 
taxids will not remain possible under such growth, 
and that alternative data resources relating to bio- 
logical samples are maturing at the INSDC partner 
institutes, has led us to a review of our practices in 
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this area. We intend to discontinue the curation of 
strain-level taxids for microbial genomes submit- 
ted beyond January 2014. Importantly, this change 
in practice will not be applied retrospectively; we 
will not remove any of the thousands of strain- 
level nodes that we have added in the past, and we 
will continue to add informal strain-specific 
names for genomes from specimens that have not 
been identified to the species level, e.g.: 
"Rhizobium sp. CCGE 510" and "Salpingoeca sp. 
ATCC 50818". 

We strongly encourage submitters to annotate 
their genome submissions with the relevant 
source metadata, including strain, culture collec- 
tion and isolation information as appropriate, plus 
the appropriate species [or subspecies) name. The 
Genomic Standards Consortium maintains check- 
lists of Minimal Information about any (x) Se- 
quence [MIxS) [9] that contain mandatory and op- 
tional descriptive metadata fields for a variety of 
organism types. These MIxS checklists can be in- 
cluded in the genome submission. 

Our alternative system for recording and present- 
ing strain-level annotation will be provided by the 
respective BioSample databases of the INSDC 
partner institutes [10-12]. BioSample records 
provide a single accessioned unit of information 
relating to a sample that has been assayed using 
sequencing or other platforms. This information 
serves to gather together taxonomic information, 
informal infraspecies information [such as strain), 
descriptors relating to the sampling process, ac- 
cession information for the physical sample itself, 
etc. 

For genome submissions, INSDC databases guide 
submitters through a series of logical steps in 
which the information required is requested and 
transferred. An early step is the registration of the 
initiative [BioProject) or indication that the ge- 
nome data are connected to an existing initiative 
[This registration is applied within the INSDC host 
institutes' respective BioProject and study data- 
bases). Following this, users are prompted to pro- 
vide rich descriptive information about the se- 
quenced sample[s) [BioSample) or an indication 
that samples already registered have been se- 
quenced. Description of new samples, and updates 
and enhancements to existing samples, take ad- 
vantage of defined checklists or 'packages' of at- 
tributes, appropriate for the initiative. In later 
steps of the genome submission process, users 



provide sequence data and functional annotation 
that connect to the samples described or selected. 

BioSample records are one tool that can be used 
as an organizing and retrieval key to the genome 
datasets, as the strain-level taxid was in the past. 
BioSample accessions can be used to aggregate 
submitted data deposited in various archives, such 
as those that cover sequence [i.e. INSDC) and 
those that cover array-based studies [such as GEO, 
ArrayExpress and the DDBJ Omics Archive) [12- 
14]. The BioSample record will enable users to re- 
trieve data across databases from samples with 
particular attributes. For instance, one may wish 
to retrieve submitted data for all Salmonella 
entehca strains isolated fr-om a particular agricul- 
tural plant. 

INSDC assembly records are another powerful 
tool in this area, as these hold the information 
about a particular genome assembly and are sup- 
ported with unique assembly-level identifiers. In 
these records all of the pieces of a genome are col- 
lected together in ways that are much more flexi- 
ble and powerful for indexing and retrieval pur- 
poses than were strain-level taxids. For example, 
genomes representing independent assemblies of 
the same sequence data share a BioSample acces- 
sion, while those representing alternative se- 
quencing studies of the same strain may have in- 
dependent BioSample accessions. 

The Streptococcus pneumoniae TIGR4 [taxid 
170187) genome initiative is described in 
PRJNA766132. This record contains two genome 
assembles that were built from sequence reads 
from a single BioSample, SAMN001035273. Two 
different assembly algorithms were used to create 
the assemblies, which are detailed in 
GCA_0002696654 and GCA_0002734455. 

In an era when microbial genome sequencing was 
not as commonplace as it is now, using a taxid as a 
key to retrieve the genome and associated project 
metadata was a reasonable approach. However, 
with next-generation sequencing technology, one 
can sequence the genomes of hundreds of closely 
related microbes in a few hours [15]. Therefore, 
data consumers are better served by the new re- 
sources that we describe above that enable them 
to retrieve sets of genomes based on common at- 
tributes or initiatives. 

The INSDC is prepared to stop assigning strain- 
level tax ids for strains of microbes that have their 
genome sequenced by January 2014 and encour- 
ages users to exploit other resources that allow 
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them to explore sequence data by initiative, spec- 
imen or genome assembly. 
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